aregmi.net
Resume

How AI Agents with LLMs Actually Work: A Technical Deep Dive

Introduction

"AI Agent" has become an overloaded term. Frameworks like ReAct, ReWOO, LangChain, and CrewAI introduce layers of abstraction that can obscure a simple underlying mechanism. This document strips away those abstractions and explains what is actually happening at a technical level when an LLM-based agent runs.


The Fundamental Primitive: Next-Token Prediction

At its core, a Large Language Model (LLM) is a next-token predictor. Given a sequence of tokens, the transformer architecture outputs a probability distribution over the entire vocabulary for the next token.

P(token_n+1 | token_1, token_2, ..., token_n)

Text is generated autoregressively: sample a token from that distribution, append it to the sequence, and repeat until a stop condition is met (stop token, max length, etc.).

Every "intelligent" behavior you observe from an agent -- reasoning, planning, tool use, self-correction -- emerges from this single operation applied repeatedly with carefully structured input context.


The Agent Loop: What Is Actually Executing

An "agent" is not a new kind of AI. It is a conventional program loop that repeatedly calls an LLM with an ever-growing context window. Here is what every agent framework reduces to:

def agent_loop(user_query, tools, system_prompt, max_iterations=10):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_query}
    ]

    for i in range(max_iterations):
        # Step 1: Call the LLM (one forward pass through the transformer)
        response = llm.generate(messages)

        if response.has_tool_calls():
            # Step 2a: LLM emitted a tool call -- execute it
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.arguments)
                messages.append({"role": "assistant", "content": tool_call})
                messages.append({"role": "tool", "content": result})
            # Loop continues: LLM will see tool results on next iteration
        else:
            # Step 2b: LLM emitted a final text answer -- loop ends
            return response.text

    return "Max iterations reached"

There is no autonomous decision to "keep going." The orchestrator code inspects whether the LLM's output contains a tool call. If yes, it executes the tool, appends the result to the message history, and calls the LLM again. If no, it returns the response. That is the entire mechanism.


How Tool Calling Works Under the Hood

Training Phase

During fine-tuning (supervised fine-tuning + RLHF), models are trained on examples where:

{
  "name": "search_database",
  "arguments": {
    "query": "quarterly revenue 2024",
    "limit": 5
  }
}

The model learns the statistical correlation: when the context contains tool definitions and the query requires information the model does not have, generating a tool-call token sequence leads to trajectories that were rewarded during training.

Inference Phase

  1. Tool definitions are injected into the prompt (either in the system message or via a dedicated API field).
  2. The model generates tokens. If its output matches the tool-call format, the API (or orchestrator) parses it as a structured tool invocation rather than plain text.
  3. A special delimiter or stop reason (e.g., finish_reason: "tool_calls") signals to the orchestrator that execution should pause for tool handling.

The model does not "understand" tools in a semantic sense. It has learned token patterns that correspond to successful tool usage from training data.


Prompting Frameworks Demystified

ReAct (Reason + Act)

ReAct is a prompting pattern, not a runtime system. It structures the model's output into an alternating cycle:

Thought: I need to find the user's order history to answer this question.
Action: search_orders(user_id="12345")
Observation: [Tool returns 3 orders from the last 30 days]
Thought: I now have the order data. The most recent order was placed on...
Answer: Your most recent order was...

Why it works: The "Thought" step forces the model to emit intermediate reasoning tokens before deciding on an action. This is chain-of-thought prompting -- generating intermediate tokens conditions the transformer's attention layers to activate more relevant features, empirically improving accuracy on downstream token predictions. The "Observation" is just the tool result appended by the orchestrator.

ReWOO (Reasoning Without Observation)

ReWOO is a different prompt structure that asks the model to plan all tool calls upfront in a single pass, then executes them in batch, then asks the model to synthesize:

Plan:
  1. search_orders(user_id="12345") -> #result1
  2. get_product_details(order_id=#result1.latest) -> #result2

[Orchestrator executes both, injects results]

Synthesize: Based on #result1 and #result2, the answer is...

Trade-off: Fewer LLM calls (lower latency and cost), but the model cannot adapt its plan based on intermediate results.

The Key Insight

Neither ReAct nor ReWOO changes the underlying mechanism. They are different strategies for structuring the messages array that gets sent to the same llm.generate() call. The orchestrator code around the LLM is what differs.


Why the Loop Converges

A natural question: why doesn't the agent loop forever? Three mechanisms ensure termination:

1. Context Conditioning (The Primary Mechanism)

Each iteration appends new information (tool results, intermediate reasoning) to the message history. The LLM's probability distribution shifts with each iteration:

This is not goal-directed behavior. It is conditional probability shifting as the input distribution changes.

2. Hard Limits (Engineering Safeguard)

Every production implementation enforces:

3. Training Signal (Learned Behavior)

During RLHF and instruction tuning:


The Full Technical Stack

User Query
    |
    v
+------------------------------------------+
| Orchestrator (Python/TypeScript)         |
| - Manages message history                |
| - Parses LLM output for tool calls      |
| - Executes tools and appends results     |
| - Enforces iteration/token limits        |
+------------------------------------------+
    |
    v
+------------------------------------------+
| Prompt Assembly                          |
| - System prompt (persona, instructions)  |
| - Tool schemas (JSON Schema definitions) |
| - Message history (growing each loop)    |
| - User query                             |
+------------------------------------------+
    |
    v
+------------------------------------------+
| LLM Inference (API or local)            |
| 1. Tokenization (text -> token IDs)      |
| 2. Embedding lookup (IDs -> vectors)     |
| 3. N transformer layers:                 |
|    - Multi-head self-attention           |
|    - Feed-forward network                |
|    - Layer normalization                 |
| 4. Output projection -> logits          |
| 5. Softmax -> probability distribution  |
| 6. Sampling (temperature, top-p, etc.)  |
| 7. Repeat until stop token              |
+------------------------------------------+
    |
    v
+------------------------------------------+
| Response Parsing                         |
| - Tool call detected?                    |
|   YES -> Execute tool -> Append result   |
|          -> Loop back to Prompt Assembly |
|   NO  -> Return final text to user       |
+------------------------------------------+

Common Misconceptions

Misconception Reality
"The agent decides to keep thinking" The orchestrator code checks for tool calls and loops. The LLM has no control over the loop.
"Agents have memory" Message history is appended to the prompt each iteration. There is no persistent memory module unless explicitly implemented (e.g., vector store retrieval).
"Agents plan ahead" The model generates one token at a time. Any apparent "planning" is emergent from training on planning-like text.
"ReAct/ReWOO are agent architectures" They are prompt templates that structure how context is accumulated across loop iterations.
"The agent understands the tools" The model has learned statistical associations between tool schemas, queries, and correct tool-call token sequences.

Practical Implications

  1. Debugging agents means debugging prompts and context. If an agent behaves incorrectly, inspect the full message history at each iteration. The model's behavior is entirely determined by its input.

  2. Framework choice matters less than prompt design. Since all frameworks reduce to the same loop, the differentiator is how effectively the prompt and tool schemas guide the model's token generation.

  3. Cost and latency scale with iterations. Each loop iteration is a full LLM inference call. More iterations = more tokens processed = higher cost and latency. Optimize by providing better context upfront to reduce the number of iterations needed.

  4. Context window is the hard constraint. As the message history grows each iteration, you approach the model's context limit. Long-running agents need strategies for context management (summarization, sliding windows, retrieval).


Summary

An LLM-based agent is:

The intelligence is in the transformer weights. The agent framework is plumbing: prompt formatting, tool execution, and loop control.